从手绘中生成图像是内容创建的至关重要和基本任务。翻译很困难,因为存在无限的可能性,并且不同的用户通常会期望不同的结果。因此,我们提出了一个统一的框架,该框架支持基于扩散模型的草图和笔触对图像合成的三维控制。用户不仅可以确定输入笔画和草图的忠诚程度,而且还可以确定现实程度,因为用户输入通常与真实图像不一致。定性和定量实验表明,我们的框架实现了最新的性能,同时提供了具有控制形状,颜色和现实主义的自定义图像的灵活性。此外,我们的方法释放了应用程序,例如在真实图像上编辑,部分草图和笔触的生成以及多域多模式合成。
translated by 谷歌翻译
当前的图像到图像翻译方法通过条件生成模型来制定任务,从而仅学习重塑或区域变化,因为条件上下文提供的丰富结构信息受到了约束。在这项工作中,我们建议将矢量量化技术引入图像到图像翻译框架。矢量量化的内容表示不仅可以促进翻译,还可以促进不同域之间共享的无条件分布。同时,加上散布的样式表示,提出的方法进一步使图像扩展能力具有灵活性,并在内域内和域间具有灵活性。定性和定量实验表明,我们的框架与最先进的图像到图像到图像翻译和图像扩展方法的性能可比。与单个任务的方法相比,所提出的方法是统一的框架,释放了组合图像到图像翻译,无条件生成和图像扩展的应用程序。例如,它为图像生成和扩展提供了样式的可变性,并为图像到图像翻译提供了进一步的扩展功能。
translated by 谷歌翻译
通过对比学习,自我监督学习最近在视觉任务中显示了巨大的潜力,这旨在在数据集中区分每个图像或实例。然而,这种情况级别学习忽略了实例之间的语义关系,有时不希望地从语义上类似的样本中排斥锚,被称为“假否定”。在这项工作中,我们表明,对于具有更多语义概念的大规模数据集来说,虚假否定的不利影响更为重要。为了解决这个问题,我们提出了一种新颖的自我监督的对比学习框架,逐步地检测并明确地去除假阴性样本。具体地,在训练过程之后,考虑到编码器逐渐提高,嵌入空间变得更加语义结构,我们的方法动态地检测增加的高质量假否定。接下来,我们讨论两种策略,以明确地在对比学习期间明确地消除检测到的假阴性。广泛的实验表明,我们的框架在有限的资源设置中的多个基准上表现出其他自我监督的对比学习方法。
translated by 谷歌翻译
近年来,文本引导的图像操纵在多媒体和计算机视觉社区中获得了越来越多的关注。条件图像生成的输入已从图像 - 仅推向多模。在本文中,我们研究一个设置,允许用户使用复杂的文本指令编辑具有多个对象的图像以添加,删除或更改对象。任务的输入是多模式,包括(1)参考图像和(2)自然语言的指令,其描述对图像的期望修改。我们提出了一种基于GaN的方法来解决这个问题。关键的想法是将文本视为神经运算符,以在本地修改图像功能。我们表明,拟议的模型对三个公共数据集的最近强大的基线进行了有利的。具体地,它产生更高保真度和语义相关性的图像,并且当用作图像查询时,导致更好的检索性能。
translated by 谷歌翻译
Diffusion models have achieved justifiable popularity by attaining state-of-the-art performance in generating realistic objects from seemingly arbitrarily complex data distributions, including when conditioning generation on labels. Unfortunately, however, their iterative nature renders them very computationally inefficient during the sampling process. For the multi-class conditional generation problem, we propose a novel, structurally unique framework of diffusion models which are hierarchically branched according to the inherent relationships between classes. In this work, we demonstrate that branched diffusion models offer major improvements in efficiently generating samples from multiple classes. We also showcase several other advantages of branched diffusion models, including ease of extension to novel classes in a continual-learning setting, and a unique interpretability that offers insight into these generative models. Branched diffusion models represent an alternative paradigm to their traditional linear counterparts, and can have large impacts in how we use diffusion models for efficient generation, online learning, and scientific discovery.
translated by 谷歌翻译
We develop a Synthetic Fusion Pyramid Network (SPF-Net) with a scale-aware loss function design for accurate crowd counting. Existing crowd-counting methods assume that the training annotation points were accurate and thus ignore the fact that noisy annotations can lead to large model-learning bias and counting error, especially for counting highly dense crowds that appear far away. To the best of our knowledge, this work is the first to properly handle such noise at multiple scales in end-to-end loss design and thus push the crowd counting state-of-the-art. We model the noise of crowd annotation points as a Gaussian and derive the crowd probability density map from the input image. We then approximate the joint distribution of crowd density maps with the full covariance of multiple scales and derive a low-rank approximation for tractability and efficient implementation. The derived scale-aware loss function is used to train the SPF-Net. We show that it outperforms various loss functions on four public datasets: UCF-QNRF, UCF CC 50, NWPU and ShanghaiTech A-B datasets. The proposed SPF-Net can accurately predict the locations of people in the crowd, despite training on noisy training annotations.
translated by 谷歌翻译
The increased importance of mobile photography created a need for fast and performant RAW image processing pipelines capable of producing good visual results in spite of the mobile camera sensor limitations. While deep learning-based approaches can efficiently solve this problem, their computational requirements usually remain too large for high-resolution on-device image processing. To address this limitation, we propose a novel PyNET-V2 Mobile CNN architecture designed specifically for edge devices, being able to process RAW 12MP photos directly on mobile phones under 1.5 second and producing high perceptual photo quality. To train and to evaluate the performance of the proposed solution, we use the real-world Fujifilm UltraISP dataset consisting on thousands of RAW-RGB image pairs captured with a professional medium-format 102MP Fujifilm camera and a popular Sony mobile camera sensor. The results demonstrate that the PyNET-V2 Mobile model can substantially surpass the quality of tradition ISP pipelines, while outperforming the previously introduced neural network-based solutions designed for fast image processing. Furthermore, we show that the proposed architecture is also compatible with the latest mobile AI accelerators such as NPUs or APUs that can be used to further reduce the latency of the model to as little as 0.5 second. The dataset, code and pre-trained models used in this paper are available on the project website: https://github.com/gmalivenko/PyNET-v2
translated by 谷歌翻译
A new and efficient neural-network and finite-difference hybrid method is developed for solving Poisson equation in a regular domain with jump discontinuities on embedded irregular interfaces. Since the solution has low regularity across the interface, when applying finite difference discretization to this problem, an additional treatment accounting for the jump discontinuities must be employed. Here, we aim to elevate such an extra effort to ease our implementation by machine learning methodology. The key idea is to decompose the solution into singular and regular parts. The neural network learning machinery incorporating the given jump conditions finds the singular solution, while the standard finite difference method is used to obtain the regular solution with associated boundary conditions. Regardless of the interface geometry, these two tasks only require supervised learning for function approximation and a fast direct solver for Poisson equation, making the hybrid method easy to implement and efficient. The two- and three-dimensional numerical results show that the present hybrid method preserves second-order accuracy for the solution and its derivatives, and it is comparable with the traditional immersed interface method in the literature. As an application, we solve the Stokes equations with singular forces to demonstrate the robustness of the present method.
translated by 谷歌翻译
为了以低成本的自动驾驶成本实现准确的3D对象检测,已经提出了许多多摄像机方法并解决了单眼方法的闭塞问题。但是,由于缺乏准确的估计深度,现有的多摄像机方法通常会沿着深度方向产生多个边界框,例如行人等困难的小物体,从而产生极低的召回。此外,将深度预测模块直接应用于通常由大型网络体系结构组成的现有多摄像机方法,无法满足自动驾驶应用程序的实时要求。为了解决这些问题,我们提出了3D对象检测的跨视图和深度引导的变压器,CrossDTR。首先,我们的轻质深度预测器旨在生成精确的对象稀疏深度图和低维深度嵌入,而在监督过程中,无需额外的深度数据集。其次,开发了一个跨视图引导的变压器,以融合深度嵌入以及来自不同视图的相机的图像特征并生成3D边界框。广泛的实验表明,我们的方法在行人检测中大大超过了10%,总体图和NDS指标中约为3%。同样,计算分析表明,我们的方法比以前的方法快5倍。我们的代码将在https://github.com/sty61010/crossdtr上公开提供。
translated by 谷歌翻译